Relationships Between Variables
Visualizing Linear Regression
Characterizing Relationships
Form (e.g. linear, quadratic, non-linear)
Direction (e.g. positive, negative)
Strength (how much scatter/noise?)
Unusual observations (do points not fit the overall pattern?)
Data for Today
The ncbirths dataset is a random sample of 1,000 cases taken from a larger dataset collected in North Carolina in 2004.
Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).
Your Turn!
How would your characterize this relationship?
What if you added another variable?
Correlation:
strength and direction of a linear relationship between two quantitative variables
Anscombe Correlations
Four datasets, very different graphical presentations
For which of these relationships is correlation a reasonable summary measure?
The Importance of Language
The word “correlation” has both a precise mathematical definition and a more general definition for typical usage in English.
These uses are obviously related and generally in sync.
There are times when these two uses can be conflated and/or misconstrued.
Linear regression:
we assume the the relationship between our response variable (\(y\)) and explanatory variable (\(x\)) can be modeled with a linear function, plus some random noise
\(response = intercept + slope \cdot explanatory + noise\)
Population Model
\(y = \beta_0 + \beta_1 \cdot x + \epsilon\)
\(y\) = response
\(\beta_0\) = population intercept
\(\beta_1\) = population slope
\(\epsilon\) = errors / residuals
Sample Model
\(\widehat{y} = b_0 + b_1 \cdot x\)
Why does this equation have a hat on \(y\)?
Obtaining Coefficient Estimates
Step 1: Fit a linear regression
Our focus (for now…)
Estimated regression equation
\[\widehat{y} = b_0 + b_1 \cdot x\]
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept -5.34 0.565 -9.45 0 -6.45 -4.23
2 weeks 0.325 0.015 22.2 0 0.296 0.354
Write out the estimated regression equation!
How do you interpret the intercept value of -5.341?
How do you interpret the slope value of 0.325?
Obtaining Residuals
\(\widehat{weight} = -5.341+0.325 \cdot weeks\)
What would the residual be for a pregnancy that lasted 39 weeks and whose baby weighed 7.63 pounds?
distinct levelsStep 2: Fit a linear regression
Step 3: Obtain coefficient table
🤔
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 7.23 0.047 155. 0 7.14 7.32
2 habit: smoker -0.4 0.13 -3.07 0.002 -0.656 -0.145
\[\widehat{weight} = 7.23 - 0.4 \cdot Smoker\]
But what does \(Smoker\) represent???
Indicator Variables
\(x\) is a categorical variable with levels:
"nonsmoker""smoker"We need to convert to:
Based on the regression table, what habit group was chosen to be the baseline?
\[\widehat{weight} = 7.23 - 0.4 \cdot 1_{Smoker}(x)\]
where
\(1_{smoker}(x) = 1\) if the mother was a "smoker"
\(1_{smoker}(x) = 0\) if the mother was a "nonsmoker"
\[\widehat{weight} = 7.23 - 0.4 \cdot 1_{Smoker}(x)\]
Given the equation, what is the estimated mean birth weight for nonsmoking mothers?
For smoking mothers?